{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extracting data from Resumes\n",
    "\n",
    "Let us assume that we are running a hiring process for a company and we have received a list of resumes from candidates. We want to extract structured data from the resumes so that we can run a screening process and shortlist candidates. \n",
    "\n",
    "Take a look at one of the resumes in the `data/resumes` directory. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"600\"\n",
       "            height=\"400\"\n",
       "            src=\"./data/resumes/ai_researcher.pdf\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x109a7dcd0>"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import IFrame\n",
    "\n",
    "IFrame(src=\"./data/resumes/ai_researcher.pdf\", width=600, height=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You will notice that all the resumes have different layouts but contain common information like name, email, experience, education, etc. \n",
    "\n",
    "With LlamaExtract, we will show you how to:\n",
    "- *Define* a data schema to extract the information of interest. \n",
    "- *Iterate* over the data schema to generalize the schema for multiple resumes.\n",
    "- *Finalize* the schema and schedule extractions for multiple resumes.\n",
    "\n",
    "We will start by defining a `LlamaExtract` client which provides a Python interface to the LlamaExtract API. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dotenv import load_dotenv\n",
    "from llama_cloud_services import LlamaExtract\n",
    "\n",
    "\n",
    "# Load environment variables (put LLAMA_CLOUD_API_KEY in your .env file)\n",
    "load_dotenv(override=True)\n",
    "\n",
    "# Optionally, add your project id/organization id\n",
    "llama_extract = LlamaExtract()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Defining the data schema\n",
    "\n",
    "Next, let us try to extract two fields from the resume: `name` and `email`. We can either use a Python dictionary structure to define the `data_schema` as a JSON or use a Pydantic model instead, for brevity and convenience. In either case, our output is guaranteed to validate against this schema."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class Resume(BaseModel):\n",
    "    name: str = Field(description=\"The name of the candidate\")\n",
    "    email: str = Field(description=\"The email address of the candidate\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Uploading files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.20s/it]\n",
      "Creating extraction jobs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.93s/it]\n",
      "Extracting files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.94s/it]\n",
      "Uploading files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.13it/s]\n",
      "Creating extraction jobs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.80it/s]\n",
      "Extracting files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.18s/it]\n",
      "Uploading files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.16it/s]\n",
      "Creating extraction jobs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.33it/s]\n",
      "Extracting files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:32<00:00, 32.86s/it]\n"
     ]
    }
   ],
   "source": [
    "from llama_cloud.core.api_error import ApiError\n",
    "\n",
    "try:\n",
    "    existing_agent = llama_extract.get_agent(name=\"resume-screening\")\n",
    "    if existing_agent:\n",
    "        llama_extract.delete_agent(existing_agent.id)\n",
    "except ApiError as e:\n",
    "    if e.status_code == 404:\n",
    "        pass\n",
    "    else:\n",
    "        raise\n",
    "\n",
    "agent = llama_extract.create_agent(name=\"resume-screening\", data_schema=Resume)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[ExtractionAgent(id=1fef43b5-8230-43b4-9e80-c1cddf53889c, name=resume-screening),\n",
       " ExtractionAgent(id=93f8508b-3570-46f0-ae62-6315b40043bd, name=receipt/noisebridge_receipt.pdf_56db3d92),\n",
       " ExtractionAgent(id=08315f0e-7146-430b-99b8-9701cb3ace6a, name=receipt/noisebridge_receipt.pdf_5c4730a7),\n",
       " ExtractionAgent(id=cfcd7756-015d-4dbd-b142-a3eefcb16cd3, name=resume/software_architect_resume.html_4a11cf15),\n",
       " ExtractionAgent(id=17cb83d9-601e-4f5c-a7aa-286e3045bcb4, name=resume/software_architect_resume.html_0b7d84a8),\n",
       " ExtractionAgent(id=adc8e88c-44d3-4613-a5aa-d666ef007494, name=slide/saas_slide.pdf_bcc627a5),\n",
       " ExtractionAgent(id=189f14cd-6370-4476-a6ad-36eafbc62618, name=slide/saas_slide.pdf_065aa22b),\n",
       " ExtractionAgent(id=b9938ca5-6225-43cb-89ea-b0065237792f, name=test2),\n",
       " ExtractionAgent(id=574d37b8-59dc-41e9-bde0-5c506a8eb670, name=test)]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "llama_extract.list_agents()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'Dr. Rachel Zhang', 'email': 'rachel.zhang@email.com'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "resume = agent.extract(\"./data/resumes/ai_researcher.pdf\")\n",
    "resume.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Iterating over the data schema\n",
    "\n",
    "Now that we have created a data schema, let us add more fields to the schema. We will add `experience` and `education` fields to the schema. \n",
    "- We can create a new Pydantic model for each of these fields and represent `experience` and `education` as lists of these models. Doing this will allow us to extract multiple entities from the resume without having to pre-define how many experiences or education the candidate has. \n",
    "- We have added a `description` parameter to provide more context for extraction. We can use `description` to provide example inputs/outputs for the extraction. \n",
    "- Note that we have annotated the `start_date` and `end_date` fields with `Optional[str]` to indicate that these fields are optional. This is *important* because the schema will be used to extract data from multiple resumes and not all resumes will have the same format. A field must only be required if it is guaranteed to be present in all the resumes. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List, Optional\n",
    "\n",
    "\n",
    "class Education(BaseModel):\n",
    "    institution: str = Field(description=\"The institution of the candidate\")\n",
    "    degree: str = Field(description=\"The degree of the candidate\")\n",
    "    start_date: Optional[str] = Field(\n",
    "        default=None, description=\"The start date of the candidate's education\"\n",
    "    )\n",
    "    end_date: Optional[str] = Field(\n",
    "        default=None, description=\"The end date of the candidate's education\"\n",
    "    )\n",
    "\n",
    "\n",
    "class Experience(BaseModel):\n",
    "    company: str = Field(description=\"The name of the company\")\n",
    "    title: str = Field(description=\"The title of the candidate\")\n",
    "    description: Optional[str] = Field(\n",
    "        default=None, description=\"The description of the candidate's experience\"\n",
    "    )\n",
    "    start_date: Optional[str] = Field(\n",
    "        default=None, description=\"The start date of the candidate's experience\"\n",
    "    )\n",
    "    end_date: Optional[str] = Field(\n",
    "        default=None, description=\"The end date of the candidate's experience\"\n",
    "    )\n",
    "\n",
    "\n",
    "class Resume(BaseModel):\n",
    "    name: str = Field(description=\"The name of the candidate\")\n",
    "    email: str = Field(description=\"The email address of the candidate\")\n",
    "    links: List[str] = Field(\n",
    "        description=\"The links to the candidate's social media profiles\"\n",
    "    )\n",
    "    experience: List[Experience] = Field(description=\"The candidate's experience\")\n",
    "    education: List[Education] = Field(description=\"The candidate's education\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will update the `data_schema` for the `resume-screening` agent to use the new `Resume` model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'Dr. Rachel Zhang',\n",
       " 'email': 'rachel.zhang@email.com',\n",
       " 'links': ['linkedin.com/in/rachelzhang',\n",
       "  'github.com/rzhang-ai',\n",
       "  'scholar.google.com/rachelzhang'],\n",
       " 'experience': [{'company': 'DeepMind',\n",
       "   'title': 'Senior Research Scientist',\n",
       "   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\\n- Built and led team of 6 researchers working on foundational ML models\\n- Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%',\n",
       "   'start_date': '2019',\n",
       "   'end_date': 'Present'},\n",
       "  {'company': 'Google Research',\n",
       "   'title': 'Research Scientist',\n",
       "   'description': '- Developed probabilistic frameworks for robust ML, published in ICML 2018\\n- Created novel attention mechanisms for computer vision models, improving accuracy by 25%\\n- Led collaboration with Google Brain team on efficient training methods for transformer models\\n- Mentored 4 PhD interns and collaborated with academic institutions',\n",
       "   'start_date': '2015',\n",
       "   'end_date': '2019'},\n",
       "  {'company': 'Columbia University',\n",
       "   'title': 'Research Assistant Professor',\n",
       "   'description': '- Published seminal work on Bayesian optimization methods (cited 1000+ times)\\n- Taught graduate-level courses in Machine Learning and Statistical Learning Theory\\n- Supervised 5 PhD students and 3 MSc students\\n- Secured $500K in research grants for probabilistic ML research',\n",
       "   'start_date': '2011',\n",
       "   'end_date': '2015'}],\n",
       " 'education': [{'institution': 'Columbia University',\n",
       "   'degree': 'Ph.D. in Computer Science',\n",
       "   'start_date': '2007',\n",
       "   'end_date': '2011'},\n",
       "  {'institution': 'Stanford University',\n",
       "   'degree': 'M.S. in Computer Science',\n",
       "   'start_date': '2005',\n",
       "   'end_date': '2007'}]}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "agent.data_schema = Resume\n",
    "resume = agent.extract(\"./data/resumes/ai_researcher.pdf\")\n",
    "resume.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is a good start. Let us add a few more fields to the schema and re-run the extraction. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TechnicalSkills(BaseModel):\n",
    "    programming_languages: List[str] = Field(\n",
    "        description=\"The programming languages the candidate is proficient in.\"\n",
    "    )\n",
    "    frameworks: List[str] = Field(\n",
    "        description=\"The tools/frameworks the candidate is proficient in, e.g. React, Django, PyTorch, etc.\"\n",
    "    )\n",
    "    skills: List[str] = Field(\n",
    "        description=\"Other general skills the candidate is proficient in, e.g. Data Engineering, Machine Learning, etc.\"\n",
    "    )\n",
    "\n",
    "\n",
    "class Resume(BaseModel):\n",
    "    name: str = Field(description=\"The name of the candidate\")\n",
    "    email: str = Field(description=\"The email address of the candidate\")\n",
    "    links: List[str] = Field(\n",
    "        description=\"The links to the candidate's social media profiles\"\n",
    "    )\n",
    "    experience: List[Experience] = Field(description=\"The candidate's experience\")\n",
    "    education: List[Education] = Field(description=\"The candidate's education\")\n",
    "    technical_skills: TechnicalSkills = Field(\n",
    "        description=\"The candidate's technical skills\"\n",
    "    )\n",
    "    key_accomplishments: str = Field(\n",
    "        description=\"Summarize the candidates highest achievements.\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'Dr. Rachel Zhang, Ph.D.',\n",
       " 'email': 'rachel.zhang@email.com',\n",
       " 'links': ['linkedin.com/in/rachelzhang',\n",
       "  'github.com/rzhang-ai',\n",
       "  'scholar.google.com/rachelzhang'],\n",
       " 'experience': [{'company': 'DeepMind',\n",
       "   'title': 'Senior Research Scientist',\n",
       "   'description': 'Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\\nPioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\\nBuilt and led team of 6 researchers working on foundational ML models\\nDeveloped novel regularization techniques for large language models, reducing catastrophic forgetting by 35%',\n",
       "   'start_date': '2019',\n",
       "   'end_date': 'Present'},\n",
       "  {'company': 'Google Research',\n",
       "   'title': 'Research Scientist',\n",
       "   'description': 'Developed probabilistic frameworks for robust ML, published in ICML 2018\\nCreated novel attention mechanisms for computer vision models, improving accuracy by 25%\\nLed collaboration with Google Brain team on efficient training methods for transformer models\\nMentored 4 PhD interns and collaborated with academic institutions',\n",
       "   'start_date': '2015',\n",
       "   'end_date': '2019'},\n",
       "  {'company': 'Columbia University',\n",
       "   'title': 'Research Assistant Professor',\n",
       "   'description': 'Published seminal work on Bayesian optimization methods (cited 1000+ times)\\nTaught graduate-level courses in Machine Learning and Statistical Learning Theory\\nSupervised 5 PhD students and 3 MSc students\\nSecured $500K in research grants for probabilistic ML research',\n",
       "   'start_date': '2011',\n",
       "   'end_date': '2015'}],\n",
       " 'education': [{'institution': 'Columbia University',\n",
       "   'degree': 'Ph.D. in Computer Science',\n",
       "   'start_date': '2007',\n",
       "   'end_date': '2011'},\n",
       "  {'institution': 'Stanford University',\n",
       "   'degree': 'M.S. in Computer Science',\n",
       "   'start_date': '2005',\n",
       "   'end_date': '2007'}],\n",
       " 'technical_skills': {'programming_languages': ['Python',\n",
       "   'C++',\n",
       "   'Julia',\n",
       "   'CUDA'],\n",
       "  'frameworks': ['PyTorch', 'TensorFlow', 'JAX', 'Ray'],\n",
       "  'skills': ['Deep Learning',\n",
       "   'Reinforcement Learning',\n",
       "   'Probabilistic Models',\n",
       "   'Multi-Task Learning',\n",
       "   'Zero-Shot Learning',\n",
       "   'Neural Architecture Search']},\n",
       " 'key_accomplishments': 'AI researcher with 12+ years of experience spanning classical machine learning, deep learning, and probabilistic modeling. Led groundbreaking research in reinforcement learning, generative models, and multi-task learning. Published 25+ papers in top-tier conferences (NeurIPS, ICML, ICLR). Strong track record of transitioning theoretical advances into practical applications in both academic and industrial settings.'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "agent.data_schema = Resume\n",
    "resume = agent.extract(\"./data/resumes/ai_researcher.pdf\")\n",
    "resume.data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Finalizing the schema\n",
    "\n",
    "This is great! We have extracted a lot of key information from the resume that is well-typed and can be used downstream for further processing. Until now, this data is ephemeral and will be lost if we close the session. Let us save the state of our extraction and use it to extract data from multiple resumes. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "agent.save()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'type': 'object',\n",
       " 'required': ['name',\n",
       "  'email',\n",
       "  'links',\n",
       "  'experience',\n",
       "  'education',\n",
       "  'technical_skills',\n",
       "  'key_accomplishments'],\n",
       " 'properties': {'name': {'type': 'string',\n",
       "   'description': 'The name of the candidate'},\n",
       "  'email': {'type': 'string',\n",
       "   'description': 'The email address of the candidate'},\n",
       "  'links': {'type': 'array',\n",
       "   'items': {'type': 'string'},\n",
       "   'description': \"The links to the candidate's social media profiles\"},\n",
       "  'education': {'type': 'array',\n",
       "   'items': {'type': 'object',\n",
       "    'required': ['institution', 'degree', 'start_date', 'end_date'],\n",
       "    'properties': {'degree': {'type': 'string',\n",
       "      'description': 'The degree of the candidate'},\n",
       "     'end_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "      'description': \"The end date of the candidate's education\"},\n",
       "     'start_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "      'description': \"The start date of the candidate's education\"},\n",
       "     'institution': {'type': 'string',\n",
       "      'description': 'The institution of the candidate'}},\n",
       "    'additionalProperties': False},\n",
       "   'description': \"The candidate's education\"},\n",
       "  'experience': {'type': 'array',\n",
       "   'items': {'type': 'object',\n",
       "    'required': ['company', 'title', 'description', 'start_date', 'end_date'],\n",
       "    'properties': {'title': {'type': 'string',\n",
       "      'description': 'The title of the candidate'},\n",
       "     'company': {'type': 'string', 'description': 'The name of the company'},\n",
       "     'end_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "      'description': \"The end date of the candidate's experience\"},\n",
       "     'start_date': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "      'description': \"The start date of the candidate's experience\"},\n",
       "     'description': {'anyOf': [{'type': 'string'}, {'type': 'null'}],\n",
       "      'description': \"The description of the candidate's experience\"}},\n",
       "    'additionalProperties': False},\n",
       "   'description': \"The candidate's experience\"},\n",
       "  'technical_skills': {'type': 'object',\n",
       "   'required': ['programming_languages', 'frameworks', 'skills'],\n",
       "   'properties': {'skills': {'type': 'array',\n",
       "     'items': {'type': 'string'},\n",
       "     'description': 'Other general skills the candidate is proficient in, e.g. Data Engineering, Machine Learning, etc.'},\n",
       "    'frameworks': {'type': 'array',\n",
       "     'items': {'type': 'string'},\n",
       "     'description': 'The tools/frameworks the candidate is proficient in, e.g. React, Django, PyTorch, etc.'},\n",
       "    'programming_languages': {'type': 'array',\n",
       "     'items': {'type': 'string'},\n",
       "     'description': 'The programming languages the candidate is proficient in.'}},\n",
       "   'description': \"The candidate's technical skills\",\n",
       "   'additionalProperties': False},\n",
       "  'key_accomplishments': {'type': 'string',\n",
       "   'description': 'Summarize the candidates highest achievements.'}},\n",
       " 'additionalProperties': False}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "agent = llama_extract.get_agent(\"resume-screening\")\n",
    "agent.data_schema  # Latest schema should be returned"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Queueing extractions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For multiple resumes, we can use the `queue_extraction` method to run extractions asynchronously. This is ideal for processing batch extraction jobs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Uploading files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.13it/s]\n",
      "Creating extraction jobs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.83it/s]\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "# All resumes in the data/resumes directory\n",
    "resumes = []\n",
    "\n",
    "with os.scandir(\"./data/resumes\") as entries:\n",
    "    for entry in entries:\n",
    "        if entry.is_file():\n",
    "            resumes.append(entry.path)\n",
    "\n",
    "jobs = await agent.queue_extraction(resumes)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get the latest status of the extractions for any `job_id`, we can use the `get_extraction_job` method. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<StatusEnum.PENDING: 'PENDING'>,\n",
       " <StatusEnum.PENDING: 'PENDING'>,\n",
       " <StatusEnum.PENDING: 'PENDING'>]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[agent.get_extraction_job(job_id=job.id).status for job in jobs]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We notice that all extraction runs are in a PENDING state. We can check back again to see if the extractions have completed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<StatusEnum.SUCCESS: 'SUCCESS'>,\n",
       " <StatusEnum.SUCCESS: 'SUCCESS'>,\n",
       " <StatusEnum.SUCCESS: 'SUCCESS'>]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[agent.get_extraction_job(job_id=job.id).status for job in jobs]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Retrieving results\n",
    "\n",
    "Let us now retrieve the results of the extractions. If the status of the extraction is `SUCCESS`, we can retrieve the data from the `data` field. In case there are errors (status = `ERROR`), we can retrieve the error message from the `error` field. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = []\n",
    "for job in jobs:\n",
    "    extract_run = agent.get_extraction_run_for_job(job.id)\n",
    "    if extract_run.status == \"SUCCESS\":\n",
    "        results.append(extract_run.data)\n",
    "    else:\n",
    "        print(f\"Extraction status for job {job.id}: {extract_run.status}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'Dr. Rachel Zhang, Ph.D.',\n",
       " 'email': 'rachel.zhang@email.com',\n",
       " 'links': ['linkedin.com/in/rachelzhang',\n",
       "  'github.com/rzhang-ai',\n",
       "  'scholar.google.com/rachelzhang'],\n",
       " 'education': [{'degree': 'Ph.D. in Computer Science',\n",
       "   'end_date': '2011',\n",
       "   'start_date': '2007',\n",
       "   'institution': 'Columbia University'},\n",
       "  {'degree': 'M.S. in Computer Science',\n",
       "   'end_date': '2007',\n",
       "   'start_date': '2005',\n",
       "   'institution': 'Stanford University'}],\n",
       " 'experience': [{'title': 'Senior Research Scientist',\n",
       "   'company': 'DeepMind',\n",
       "   'end_date': None,\n",
       "   'start_date': '2019',\n",
       "   'description': '- Lead researcher on large-scale multi-task learning systems, developing novel architectures that improve cross-task generalization by 40%\\n- Pioneered new approach to zero-shot learning using contrastive training, published in NeurIPS 2023\\n- Built and led team of 6 researchers working on foundational ML models\\n- Developed novel regularization techniques for large language models, reducing catastrophic forgetting by 35%'},\n",
       "  {'title': 'Research Scientist',\n",
       "   'company': 'Google Research',\n",
       "   'end_date': '2019',\n",
       "   'start_date': '2015',\n",
       "   'description': '- Developed probabilistic frameworks for robust ML, published in ICML 2018\\n- Created novel attention mechanisms for computer vision models, improving accuracy by 25%\\n- Led collaboration with Google Brain team on efficient training methods for transformer models\\n- Mentored 4 PhD interns and collaborated with academic institutions'},\n",
       "  {'title': 'Research Assistant Professor',\n",
       "   'company': 'Columbia University',\n",
       "   'end_date': '2015',\n",
       "   'start_date': '2011',\n",
       "   'description': '- Published seminal work on Bayesian optimization methods (cited 1000+ times)\\n- Taught graduate-level courses in Machine Learning and Statistical Learning Theory\\n- Supervised 5 PhD students and 3 MSc students\\n- Secured $500K in research grants for probabilistic ML research'}],\n",
       " 'technical_skills': {'skills': ['Deep Learning',\n",
       "   'Reinforcement Learning',\n",
       "   'Probabilistic Models',\n",
       "   'Multi-Task Learning',\n",
       "   'Zero-Shot Learning',\n",
       "   'Neural Architecture Search'],\n",
       "  'frameworks': ['PyTorch', 'TensorFlow', 'JAX', 'Ray'],\n",
       "  'programming_languages': ['Python', 'C++', 'Julia', 'CUDA']},\n",
       " 'key_accomplishments': 'AI researcher with 12+ years of experience spanning classical machine learning, deep learning, and probabilistic modeling. Led groundbreaking research in reinforcement learning, generative models, and multi-task learning. Published 25+ papers in top-tier conferences (NeurIPS, ICML, ICLR). Strong track record of transitioning theoretical advances into practical applications in both academic and industrial settings.'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'Alex Park',\n",
       " 'email': 'alex park@email.com',\n",
       " 'links': ['linkedin.com/in/alexpark'],\n",
       " 'education': [{'degree': 'M.S. Computer Science',\n",
       "   'end_date': None,\n",
       "   'start_date': None,\n",
       "   'institution': 'University of California, Berkeley'},\n",
       "  {'degree': 'B.S. Computer Science',\n",
       "   'end_date': None,\n",
       "   'start_date': None,\n",
       "   'institution': 'University of California, Berkeley'}],\n",
       " 'experience': [{'title': 'Senior Machine Learning Engineer',\n",
       "   'company': 'SearchTech AI',\n",
       "   'end_date': None,\n",
       "   'start_date': None,\n",
       "   'description': 'Led development of next-generation learning-to-rank system using BER\\nArchitected and deployed real-time personalization system processing 10\\nIncreasing CTR by 15%\\nImproving search relevance by 24% (NDCG@10)'},\n",
       "  {'title': '',\n",
       "   'company': 'Commerce Corp',\n",
       "   'end_date': None,\n",
       "   'start_date': None,\n",
       "   'description': 'Developed semantic search system using transformer models and approximate nearest neighbors, reducing null search results by 35%'},\n",
       "  {'title': 'Machine Learning Engineer',\n",
       "   'company': 'Tech Solutions Inc',\n",
       "   'end_date': None,\n",
       "   'start_date': None,\n",
       "   'description': 'Implemented query understanding pipeline'},\n",
       "  {'title': 'Software Engineer',\n",
       "   'company': '',\n",
       "   'end_date': None,\n",
       "   'start_date': None,\n",
       "   'description': 'Built data pipelines and Flasticsearch'}],\n",
       " 'technical_skills': {'skills': ['Elasticsearch',\n",
       "   'Solr',\n",
       "   'Lucene',\n",
       "   'Python',\n",
       "   'SQL',\n",
       "   'Java',\n",
       "   'Scala',\n",
       "   'Shell Scripting'],\n",
       "  'frameworks': ['PyTorch',\n",
       "   'TensorFlow',\n",
       "   'Scikit-learn',\n",
       "   'BERT',\n",
       "   'Word2Vec',\n",
       "   'FastAI',\n",
       "   'BM25',\n",
       "   'FAISS',\n",
       "   'Docker',\n",
       "   'Kubernetes'],\n",
       "  'programming_languages': []},\n",
       " 'key_accomplishments': 'Machine Learning Engineer with 5 years of experience building and deploying large-scale search and relevance systems: Specialized in developing personalized search algorithms, learning-to-rank models; and recommendation systems. Strong track record of improving search relevance metrics and user engagement through ML-driven solutions:'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'name': 'Sarah Chen',\n",
       " 'email': 'sarah.chen@email.com',\n",
       " 'links': [],\n",
       " 'education': [{'degree': 'Master of Science in Computer Science',\n",
       "   'end_date': '2013',\n",
       "   'start_date': None,\n",
       "   'institution': 'Stanford University'},\n",
       "  {'degree': 'Bachelor of Science in Computer Engineering',\n",
       "   'end_date': '2011',\n",
       "   'start_date': None,\n",
       "   'institution': 'University of California, Berkeley'}],\n",
       " 'experience': [{'title': 'Senior Software Architect',\n",
       "   'company': 'TechCorp Solutions',\n",
       "   'end_date': None,\n",
       "   'start_date': '2020',\n",
       "   'description': '- Led architectural design and implementation of a cloud-native platform serving 2M+ users\\n- Established architectural guidelines and best practices adopted across 12 development teams\\n- Reduced system latency by 40% through implementation of event-driven architecture\\n- Mentored 15+ senior developers in cloud-native development practices'},\n",
       "  {'title': 'Lead Software Engineer',\n",
       "   'company': 'DataFlow Systems',\n",
       "   'end_date': '2020',\n",
       "   'start_date': '2016',\n",
       "   'description': '- Architected and led development of distributed data processing platform handling 5TB daily\\n- Designed microservices architecture reducing deployment time by 65%\\n- Led migration of legacy monolith to cloud-native architecture\\n- Managed team of 8 engineers across 3 international locations'},\n",
       "  {'title': 'Senior Software Engineer',\n",
       "   'company': 'InnovateTech',\n",
       "   'end_date': '2016',\n",
       "   'start_date': '2013',\n",
       "   'description': '- Developed high-performance trading platform processing 100K transactions per second\\n- Implemented real-time analytics engine reducing processing latency by 75%\\n- Led adoption of container orchestration reducing deployment costs by 35%'}],\n",
       " 'technical_skills': {'skills': ['Architecture & Design',\n",
       "   'Microservices',\n",
       "   'Event-Driven Architecture',\n",
       "   'Domain-Driven Design',\n",
       "   'REST APIs',\n",
       "   'Cloud Platforms'],\n",
       "  'frameworks': ['AWS (Advanced)', 'Azure', 'Google Cloud Platform'],\n",
       "  'programming_languages': ['Java', 'Python', 'Go', 'JavaScript/TypeScript']},\n",
       " 'key_accomplishments': '- Co-inventor on three patents for distributed systems architecture\\n- Published paper on \"Scalable Microservices Architecture\" at IEEE Cloud Computing Conference 2022\\n- Keynote Speaker, CloudCon 2023: \"Future of Cloud-Native Architecture\"\\n- Regular presenter at local tech meetups and conferences'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Congratulations! You now have an agent that can extract structured data from resumes. \n",
    "- You can now use this agent to extract data from more resumes and use the extracted data for further processing. \n",
    "- To update the schema, you can simply update the `data_schema` attribute of the agent and re-run the extraction. \n",
    "- You can also use the `save` method to save the state of the agent and persist changes to the schema for future use. \n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
